New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

tokenizers

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

tokenizers

<img alt="Build" src="https://badge.fury.io/js/tok

0.7.0
Source
npm

Version published: 5 years ago

Weekly downloads: 537; increased by6.55%

Maintainers: 3

Weekly downloads

Created: 6 years ago

Source

NodeJS implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Main features

Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

npm install tokenizers@latest

Basic example

import { BertWordPieceTokenizer } from "tokenizers";

const wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" });
const wpEncoded = await wordPieceTokenizer.encode("Who is John?", "John is a teacher");

console.log(wpEncoded.length);
console.log(wpEncoded.tokens);
console.log(wpEncoded.ids);
console.log(wpEncoded.attentionMask);
console.log(wpEncoded.offsets);
console.log(wpEncoded.overflowing);
console.log(wpEncoded.specialTokensMask);
console.log(wpEncoded.typeIds);
console.log(wpEncoded.wordIndexes);

Provided Tokenizers

BPETokenizer: The original BPE
ByteLevelBPETokenizer: The byte level version of the BPE
SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

License

Apache License 2.0

FAQs

What is tokenizers?

Is tokenizers popular?

Is tokenizers well maintained?

Package last updated on 01 Jul 2020

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

tokenizers

Main features

Installation

Basic example

Provided Tokenizers

License

Related posts

require(esm) Backported to Node.js 20, Paving the Way for ESM-Only Packages

PyPI Now Supports iOS and Android Wheels for Mobile Python Development